"dataset_name": "machine_learning"
"description": "The following are multiple choice questions (with answers) about machine\
  \ learning.\n\nQ: Which image data augmentation is most common for natural images?\n\
  (A) random crop and horizontal flip (B) random crop and vertical flip (C) posterization\
  \ (D) dithering\nA: Let's think step by step. Data augmentation is used to increase\
  \ the diversity of images in the training dataset. It is important that natural\
  \ images are kept natural after being augmented. Vertical flips of images are not\
  \ natural, so (B) is false. Posterization makes the image look like a poster and\
  \ and dithering increases color depth. None of these two preserve the natural property.\
  \ The only natural data augmentation technique is (A). The answer is (A).\n\nQ:\
  \ Traditionally, when we have a real-valued input attribute during decision-tree\
  \ learning we consider a binary split according to whether the attribute is above\
  \ or below some threshold. Pat suggests that instead we should just have a multiway\
  \ split with one branch for each of the distinct values of the attribute. From the\
  \ list below choose the single biggest problem with Pat’s suggestion:\n(A) It is\
  \ too computationally expensive. (B) It would probably result in a decision tree\
  \ that scores badly on the training set and a testset. (C) It would probably result\
  \ in a decision tree that scores well on the training set but badly on a testset.\
  \ (D) It would probably result in a decision tree that scores well on a testset\
  \ but badly on a training set.\nA: Let's think step by step. Because the input is\
  \ real valued, it is unlikely that the same values appear both at training and test\
  \ time. This means that while such a decision tree could yield good performance\
  \ on the training data, when evaluated on the test data it will perform badly because\
  \ the decision tree won’t know what to do with numbers that did not appear in the\
  \ training data. The answer is (C).\n\nQ: You are reviewing papers for the World’s\
  \ Fanciest Machine Learning Conference, and you see submissions with the following\
  \ claims. Which ones would you consider accepting?\n(A) My method achieves a training\
  \ error lower than all previous methods! (B) My method achieves a test error lower\
  \ than all previous methods! (Footnote: When regularisation parameter λ is chosen\
  \ so as to minimise test error.) (C) My method achieves a test error lower than\
  \ all previous methods! (Footnote: When regularisation parameter λ is chosen so\
  \ as to minimise cross-validaton error.) (D) My method achieves a cross-validation\
  \ error lower than all previous methods! (Footnote: When regularisation parameter\
  \ λ is chosen so as to minimise cross-validaton error.)\nA: Let's think step by\
  \ step. In machine learning, we train with some data and fixed hyperparameters and\
  \ the training error can be arbitrarily low, so (A) can’t be right. Then, one compares\
  \ different hyperparameters by selecting the model with the lowest cross-validation\
  \ error, this means that (B) and (D) are not the right procedure. The only relevant\
  \ number after these is the test error and thus (C) is the right answer. The answer\
  \ is (C).\n\nQ: A 6-sided die is rolled 15 times and the results are: side 1 comes\
  \ up 0 times; side 2: 1 time; side 3: 2 times; side 4: 3 times; side 5: 4 times;\
  \ side 6: 5 times. Based on these results, what is the probability of side 3 coming\
  \ up when using Add-1 Smoothing?\n(A) 2.0/15 (B) 1.0/7 (C) 3.0/16 (D) 1.0/5\nA:\
  \ Let's think step by step. Add-1 smoothing adds the value of one to the different\
  \ counts and then normalizes the probabilities accordingly. The counts after adding\
  \ one will be: side 1 comes up 1 time; side 2: 2 times; side 3: 3 times; side 4:\
  \ 4 times; side 5: 5 times; side 6: 6 times. The number of sum one die rolls will\
  \ be 21, so the probability of drawing a three is 3/21 = 1/7. The answer is (B).\n\
  \nQ: To achieve an 0/1 loss estimate that is less than 1 percent of the true 0/1\
  \ loss (with probability 95%), according to Hoeffding's inequality the IID test\
  \ set must have how many examples?\n(A) around 10 examples (B) around 100 examples\
  \ (C) between 100 and 500 examples (D) more than 1000 examples\nA: Let's think step\
  \ by step. By the Hoeffding’s inequality, we expect that with 95% probability the\
  \ in-sample and out-of-sample errors differ by epsilon when we have N samples if\
  \ 2 exp(-2 epsilon^2 N)<0.05, this implies that N > -1/(2*epsilon**2) log ( 0.05/2\
  \ )= log (40)*5000. Since log(40)>1, we have that one needs more than 1000 examples.\
  \ The answer is (D).\n\n"
"group": "mmlu_flan_cot_fewshot_stem"
"include": "_mmlu_flan_cot_fewshot_template_yaml"
"task": "mmlu_flan_cot_fewshot_machine_learning"
